Spanish Translation A/B Test¶
Company XYZ is a worldwide e-ecommerce site with localized versions of the site. It was observed that Spain based users have a much higher conversion rate than any other Spanish speaking country. One of the possible reasons could be poor translation. However, it was noticed that all Spanish speaking countries had the same translation as that of the Spain based site written by a Spaniard. Hence, it was agreed upon to conduct an A/B test, where two versions of the site would be released. One of these versions would be written by a local translator from the native country and the other would be the original site written by the Spaniard.
After running the test for five days, the results turned out to be negative. This implies, the local translation did poorly as compared to the original translation.
The following analysis is to investigate, if the test was actually negative and if so, the possible reasons for it.
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
from matplotlib import pyplot as plt
import matplotlib.ticker as ticker
import scipy as sc
from scipy import stats
import sklearn
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Imputer
from StringIO import StringIO
from inspect import getmembers
%matplotlib inline
Reading in the data
test_table = pd.read_csv('test_table.csv')
user_table = pd.read_csv('user_table.csv')
test_table.head(5)
user_table.head(5)
Merging the two datasets by checking for unique id's
len(test_table) == len(test_table['user_id'].unique())
len(user_table) == len(user_table['user_id'].unique())
Comparing the lengths of both the tables
test_table.shape
user_table.shape
This implies the user_table is missing a few id's. Therefore when we perform a join operation, we shouldn't loose the user id's in the test table that are not in the user table.
We could either do a left join or an outer join. Going with the outer join in the following case
data = pd.merge(test_table, user_table, on = 'user_id', how = 'outer')
data.head(5)
data.shape
Summarizing the data. Getting the basic descriptive statistics.
data.describe()
Some insights from the data so far.¶
- Average conversion rate is roughly 4%. This is pretty normal. Considered to be industry standard.
- 47% of the population belong in the test group and 53% in the control group
- This is a fairly young user base with a mean age of 27 years. Also 75% of the user base is within 30 years of age.
Understanding if Spain actually converts best, as compared to other Spanish speaking nations
data.groupby('country')[['conversion']].mean()
From the above results, it is quite evident that spain has a conversion rate of nearly 7.9% whereas other nations have conversion rates in the range of 4-5%. Therefore, Spain indeed has the best conversion rate.
Below is the comparison of the performance in test and control groups. It can be seen that the control group did much better than the test group.
data.groupby('test')[['conversion']].mean()
Below is the comparison of the test and control group without Spain in the picture
data_new =data.copy()
data_new = data_new[data_new['country']!= 'Spain']
data_new.head(5)
data_new.groupby('test')[['conversion']].mean()
Some quick insights¶
From the results it is quite evident that both the control group and the test group are faring similarly with the control group performing slightly better.
There happened to be more spaniards in the control group as compared to the test group. As their conversion rate was higher, the control group had a higher mean. However, removing them as caused both the control and test group to perform similarly
Secondly, for other countries other than spain, it doesn't seem to matter whether the translation is by a local or by a spaniard. The test results are more or less the same
Doing a welch two sample t test on both the groups to check if there is a statistical difference in the mean of the two groups
zero = data_new[data_new['test'] == 0]
one = data_new[data_new['test'] == 1]
sc.stats.ttest_ind(zero['conversion'], one['conversion'], equal_var = False, axis = 0)
Insights¶
As the p value is less than alpha = .05, we can reject the null hypothesis. This implies, we can tell with statistical significance that the two groups have different means. Mean of test = 4.3% and mean of control = 4.8%. This would be a significant difference in means, if it were true
Likely reasons for this include
- Control group and test group are not really random
- Not enough data for the sample to truly represent the population
Converting data to standard format. This includes formatting datetime column
data_new['date'] = pd.to_datetime(data_new['date'], infer_datetime_format = True)
Plotting to check for any anomalies or biases
time_series = data_new.groupby(['date','test'])[['conversion']].mean()
time_series
time_series = time_series.unstack()['conversion'][1]/time_series.unstack()['conversion'][0]
time_series
fig, ax = plt.subplots(1,1)
ax.plot(time_series)
ax.set_xlabel('Date')
ax.set_ylabel('test/control')
ax.set_title('Line Plot')
ax.xaxis.set_major_locator(ticker.MultipleLocator())
Insights¶
From the above graph it can be seen that over the course of five days, control does continuously better than test. Also the variability in the test/control variable is low. Min = 0.87 and Max = 0.93. This implies that data collected is sufficient, however there might be a bias in the control or test group.
Likely cause
Some segment of the data that has a higher or lower conversion rate has found it's way into either test or control thus increasing/decreasing that groups' overall conversion rate.
We can use decision trees to identify this. If the split between test and control is truly random, then the tree shouldn't be able to split well.
X = data_new.copy()
lb = LabelEncoder()
X['source'] = lb.fit_transform(X['source'])
X['country'] = lb.fit_transform(X['country'])
X['device'] = lb.fit_transform(X['device'])
X['browser_language'] = lb.fit_transform(X['browser_language'])
X['ads_channel'] = lb.fit_transform(X['ads_channel'])
X['browser'] = lb.fit_transform(X['browser'])
X['sex'] = lb.fit_transform(X['sex'])
X = X.drop(['conversion','date','test'], axis = 1)
y = data_new['test']
X.head(5)
#Checking for missing values
X.isnull().sum()
#Index values of all NAN's
index = X['age'].index[X['age'].apply(np.isnan)]
index
Below, imputing the missing values in age column with median age of the column
impute = Imputer(missing_values = 'NaN', strategy = 'median', axis = 0, copy = True)
imputed = DataFrame(impute.fit_transform(X))
imputed.columns = X.columns.values
imputed.head(5)
Creating an instance of the Decision tree classifier below
clf = DecisionTreeClassifier(criterion = 'entropy', max_depth = 2, min_samples_leaf = 2, min_samples_split = 2)
clf.fit(imputed,y)
Understanding the most important features from the classification
clf.feature_importances_
imputed.columns.values
Insights¶
There seems to be a fair amount of bias in the way control and test groups are separated. This is highlighted by the feature importance variable. Country seems to be an important feature when it comes to separating test and control groups. Therefore, the separation is not truly random
zip(imputed.columns[clf.tree_.feature], clf.tree_.threshold, clf.tree_.children_left, clf.tree_.children_right)
clf.tree_.children_left
imputed['country'].unique()
data_new['country'].unique()
a = data_new.groupby(['country','test'])[['conversion']].mean().unstack()
b = data_new.groupby('country')[['test']].mean()
df = pd.concat([a,b], axis = 1)
temp1 = data_new[data_new['test'] == 0]
temp2 = data_new[data_new['test'] == 1]
a = []; b = []; c = []; d = []
for i, j in temp1.groupby('country')['conversion']:
a.append(i)
b.append(j)
for i, j in temp2.groupby('country')['conversion']:
c.append(i)
d.append(j)
p_value = []
for i in np.arange(16):
p_value.append(sc.stats.ttest_ind(b[i], d[i], equal_var = False, axis = 0)[1])
df = pd.concat([df, DataFrame(p_value, index = a)], axis = 1)
df.columns = ['mean in control', 'mean in test', '%samples in test group', 'p_value']
df
Conclusions¶
As it can be seen the p-value associated with all the countries is much greater than 0.05
This implies that we fail to reject the null hypothesis. Thus, there is no statistically significant difference in means between test and control group in each country.
However, as it can be seen Argentina and Uruguay have the lowest conversion rate of 1%. Also, nearly 80% of the samples from these two countries found it's way into the test group and only 20% in the control group.This can be verified from the third column in the above dataframe.
Due to their small conversion rate and massive influx of samples into the test group, there was a significant difference between the overall test conversion and control conversion rates. It was because of this the mean of the test group was much less than the mean of the control group.
However, it is now clear that the A/B test was insignificant. Both the test and the control group perform similarly. It is clear that the local translation did not affect the conversion rate.
Side Note
- Argentina and Uruguay have the lowest conversion rate
- Marketing efforts can be directed in this direction to improve conversion rate in these two countries